In this markdown we will use the Diabetes Health Indicator Dataset in-order to make some predictions on a person having diabetes based on his/her health condition.
First we implement all the requiered libraries in the below section. we will answer the questions afterwards…

library(ggplot2)
library(vioplot)
## Warning: package 'vioplot' was built under R version 4.3.1
## Loading required package: sm
## Warning: package 'sm' was built under R version 4.3.1
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.1
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.1
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.1
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(psych)
## Warning: package 'psych' was built under R version 4.3.1
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(caTools)
## Warning: package 'caTools' was built under R version 4.3.1
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.3.1
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.1
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
## 
##     outlier
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.3.1
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.1


now we load the data

data1 <- read.csv("./datasets/diabetes_binary_health_indicators_BRFSS2015.csv")
data2 <- read.csv("./datasets/diabetes_binary_5050split_health_indicators_BRFSS2015.csv")
data3 <- read.csv("./datasets/diabetes_012_health_indicators_BRFSS2015.csv")

throughout this markdown we will check the performance of our models for all 3 datasets.
now let’s take a look at the data!

print(paste("dimension of the 1st dataset: " ,dim(data1)))
## [1] "dimension of the 1st dataset:  253680"
## [2] "dimension of the 1st dataset:  22"
print("first rows of the 1st dataset")
## [1] "first rows of the 1st dataset"
head(data1)
##   Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke
## 1               0      1        1         1  40      1      0
## 2               0      0        0         0  25      1      0
## 3               0      1        1         1  28      0      0
## 4               0      1        0         1  27      0      0
## 5               0      1        1         1  24      0      0
## 6               0      1        1         1  25      1      0
##   HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 1                    0            0      0       1                 0
## 2                    0            1      0       0                 0
## 3                    0            0      1       0                 0
## 4                    0            1      1       1                 0
## 5                    0            1      1       1                 0
## 6                    0            1      1       1                 0
##   AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age
## 1             1           0       5       18       15        1   0   9
## 2             0           1       3        0        0        0   0   7
## 3             1           1       5       30       30        1   0   9
## 4             1           0       2        0        0        0   0  11
## 5             1           0       2        3        0        0   0  11
## 6             1           0       2        0        2        0   1  10
##   Education Income
## 1         4      3
## 2         6      1
## 3         4      8
## 4         3      6
## 5         5      4
## 6         6      8
print("number of unique diabetes/non-diabetes occurances")
## [1] "number of unique diabetes/non-diabetes occurances"
table(data1$Diabetes_binary)
## 
##      0      1 
## 218334  35346
print(paste("dimension of the 2snd dataset: " ,dim(data2)))
## [1] "dimension of the 2snd dataset:  70692"
## [2] "dimension of the 2snd dataset:  22"
print("first rows of the 2nd dataset")
## [1] "first rows of the 2nd dataset"
head(data2)
##   Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke
## 1               0      1        0         1  26      0      0
## 2               0      1        1         1  26      1      1
## 3               0      0        0         1  26      0      0
## 4               0      1        1         1  28      1      0
## 5               0      0        0         1  29      1      0
## 6               0      0        0         1  18      0      0
##   HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 1                    0            1      0       1                 0
## 2                    0            0      1       0                 0
## 3                    0            1      1       1                 0
## 4                    0            1      1       1                 0
## 5                    0            1      1       1                 0
## 6                    0            1      1       1                 0
##   AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age
## 1             1           0       3        5       30        0   1   4
## 2             1           0       3        0        0        0   1  12
## 3             1           0       1        0       10        0   1  13
## 4             1           0       3        0        3        0   1  11
## 5             1           0       2        0        0        0   0   8
## 6             0           0       2        7        0        0   0   1
##   Education Income
## 1         6      8
## 2         6      8
## 3         6      8
## 4         6      8
## 5         5      8
## 6         4      7
print("number of unique diabetes/non-diabetes occurances")
## [1] "number of unique diabetes/non-diabetes occurances"
table(data2$Diabetes_binary)
## 
##     0     1 
## 35346 35346
print(paste("dimension of the 3rd dataset: " ,dim(data3)))
## [1] "dimension of the 3rd dataset:  253680"
## [2] "dimension of the 3rd dataset:  22"
print("first rows of the 3rd dataset")
## [1] "first rows of the 3rd dataset"
head(data3)
##   Diabetes_012 HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack
## 1            0      1        1         1  40      1      0                    0
## 2            0      0        0         0  25      1      0                    0
## 3            0      1        1         1  28      0      0                    0
## 4            0      1        0         1  27      0      0                    0
## 5            0      1        1         1  24      0      0                    0
## 6            0      1        1         1  25      1      0                    0
##   PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost
## 1            0      0       1                 0             1           0
## 2            1      0       0                 0             0           1
## 3            0      1       0                 0             1           1
## 4            1      1       1                 0             1           0
## 5            1      1       1                 0             1           0
## 6            1      1       1                 0             1           0
##   GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
## 1       5       18       15        1   0   9         4      3
## 2       3        0        0        0   0   7         6      1
## 3       5       30       30        1   0   9         4      8
## 4       2        0        0        0   0  11         3      6
## 5       2        3        0        0   0  11         5      4
## 6       2        0        2        0   1  10         6      8
print("number of unique diabetes/non-diabetes occurances")
## [1] "number of unique diabetes/non-diabetes occurances"
table(data3$Diabetes_binary)
## < table of extent 0 >

now we will check if there are any duplicate rows

duplicates1 <- sum(duplicated(data1))
p1 <- duplicates1 / nrow(data1)
duplicates2 <- sum(duplicated(data2))
p2 <- duplicates2 / nrow(data2)
duplicates3 <- sum(duplicated(data3))
p3 <- duplicates3 / nrow(data3)
print(paste("Number of duplicate rows in the first dataset: ", duplicates1, "which is ", p1*100,"%"))
## [1] "Number of duplicate rows in the first dataset:  24206 which is  9.54194260485651 %"
print(paste("Number of duplicate rows in the second dataset: ", duplicates2, "which is ", p2*100,"%"))
## [1] "Number of duplicate rows in the second dataset:  1635 which is  2.3128501103378 %"
print(paste("Number of duplicate rows in the third dataset: ", duplicates3, "which is ", p3*100,"%"))
## [1] "Number of duplicate rows in the third dataset:  23899 which is  9.42092399873857 %"

we can see that a little, but noticeable, portion of data is duplicated. however it doesn’t have a problem because this data is gathered by asking people about their conditions and duplicates are completely natural.

now we will check if there are any NA rows in the data

null1 <- sum(is.na(data1))
null2 <- sum(is.na(data2))
null3 <- sum(is.na(data3))
print(paste("Number of not available rows in the first dataset: ", null1))
## [1] "Number of not available rows in the first dataset:  0"
print(paste("Number of not available rows in the second dataset: ", null2))
## [1] "Number of not available rows in the second dataset:  0"
print(paste("Number of not available rows in the third dataset: ", null3))
## [1] "Number of not available rows in the third dataset:  0"

as we can see there are no NA rows so we can continue without worrying about missing data.

now let’s see the number of distinct values in each column of our data

distinct_values <- function(data_frame) {
  for (col in colnames(data_frame)) {
    distinct_count <- length(unique(data_frame[[col]]))
    print(paste("Column:", col, "number of distinctvalues:",
                distinct_count))
  }
}
print("-Dataset1-")
## [1] "-Dataset1-"
distinct_values(data1)
## [1] "Column: Diabetes_binary number of distinctvalues: 2"
## [1] "Column: HighBP number of distinctvalues: 2"
## [1] "Column: HighChol number of distinctvalues: 2"
## [1] "Column: CholCheck number of distinctvalues: 2"
## [1] "Column: BMI number of distinctvalues: 84"
## [1] "Column: Smoker number of distinctvalues: 2"
## [1] "Column: Stroke number of distinctvalues: 2"
## [1] "Column: HeartDiseaseorAttack number of distinctvalues: 2"
## [1] "Column: PhysActivity number of distinctvalues: 2"
## [1] "Column: Fruits number of distinctvalues: 2"
## [1] "Column: Veggies number of distinctvalues: 2"
## [1] "Column: HvyAlcoholConsump number of distinctvalues: 2"
## [1] "Column: AnyHealthcare number of distinctvalues: 2"
## [1] "Column: NoDocbcCost number of distinctvalues: 2"
## [1] "Column: GenHlth number of distinctvalues: 5"
## [1] "Column: MentHlth number of distinctvalues: 31"
## [1] "Column: PhysHlth number of distinctvalues: 31"
## [1] "Column: DiffWalk number of distinctvalues: 2"
## [1] "Column: Sex number of distinctvalues: 2"
## [1] "Column: Age number of distinctvalues: 13"
## [1] "Column: Education number of distinctvalues: 6"
## [1] "Column: Income number of distinctvalues: 8"
print("-Dataset2-")
## [1] "-Dataset2-"
distinct_values(data2)
## [1] "Column: Diabetes_binary number of distinctvalues: 2"
## [1] "Column: HighBP number of distinctvalues: 2"
## [1] "Column: HighChol number of distinctvalues: 2"
## [1] "Column: CholCheck number of distinctvalues: 2"
## [1] "Column: BMI number of distinctvalues: 80"
## [1] "Column: Smoker number of distinctvalues: 2"
## [1] "Column: Stroke number of distinctvalues: 2"
## [1] "Column: HeartDiseaseorAttack number of distinctvalues: 2"
## [1] "Column: PhysActivity number of distinctvalues: 2"
## [1] "Column: Fruits number of distinctvalues: 2"
## [1] "Column: Veggies number of distinctvalues: 2"
## [1] "Column: HvyAlcoholConsump number of distinctvalues: 2"
## [1] "Column: AnyHealthcare number of distinctvalues: 2"
## [1] "Column: NoDocbcCost number of distinctvalues: 2"
## [1] "Column: GenHlth number of distinctvalues: 5"
## [1] "Column: MentHlth number of distinctvalues: 31"
## [1] "Column: PhysHlth number of distinctvalues: 31"
## [1] "Column: DiffWalk number of distinctvalues: 2"
## [1] "Column: Sex number of distinctvalues: 2"
## [1] "Column: Age number of distinctvalues: 13"
## [1] "Column: Education number of distinctvalues: 6"
## [1] "Column: Income number of distinctvalues: 8"
print("-Dataset3-")
## [1] "-Dataset3-"
distinct_values(data3)
## [1] "Column: Diabetes_012 number of distinctvalues: 3"
## [1] "Column: HighBP number of distinctvalues: 2"
## [1] "Column: HighChol number of distinctvalues: 2"
## [1] "Column: CholCheck number of distinctvalues: 2"
## [1] "Column: BMI number of distinctvalues: 84"
## [1] "Column: Smoker number of distinctvalues: 2"
## [1] "Column: Stroke number of distinctvalues: 2"
## [1] "Column: HeartDiseaseorAttack number of distinctvalues: 2"
## [1] "Column: PhysActivity number of distinctvalues: 2"
## [1] "Column: Fruits number of distinctvalues: 2"
## [1] "Column: Veggies number of distinctvalues: 2"
## [1] "Column: HvyAlcoholConsump number of distinctvalues: 2"
## [1] "Column: AnyHealthcare number of distinctvalues: 2"
## [1] "Column: NoDocbcCost number of distinctvalues: 2"
## [1] "Column: GenHlth number of distinctvalues: 5"
## [1] "Column: MentHlth number of distinctvalues: 31"
## [1] "Column: PhysHlth number of distinctvalues: 31"
## [1] "Column: DiffWalk number of distinctvalues: 2"
## [1] "Column: Sex number of distinctvalues: 2"
## [1] "Column: Age number of distinctvalues: 13"
## [1] "Column: Education number of distinctvalues: 6"
## [1] "Column: Income number of distinctvalues: 8"

as the final step of our pre-processing stage we will take a look at the summary of all datasets

summary(data1)
##  Diabetes_binary      HighBP         HighChol        CholCheck     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :1.0000  
##  Mean   :0.1393   Mean   :0.429   Mean   :0.4241   Mean   :0.9627  
##  3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##       BMI            Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :12.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000     
##  1st Qu.:24.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000     
##  Median :27.00   Median :0.0000   Median :0.00000   Median :0.00000     
##  Mean   :28.38   Mean   :0.4432   Mean   :0.04057   Mean   :0.09419     
##  3rd Qu.:31.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000     
##  Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000     
##   PhysActivity        Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.0000   
##  Mean   :0.7565   Mean   :0.6343   Mean   :0.8114   Mean   :0.0562   
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :2.000   Median : 0.000  
##  Mean   :0.9511   Mean   :0.08418   Mean   :2.511   Mean   : 3.185  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:3.000   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 6.000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 8.000  
##  Mean   : 4.242   Mean   :0.1682   Mean   :0.4403   Mean   : 8.032  
##  3rd Qu.: 3.000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:10.000  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##    Education        Income     
##  Min.   :1.00   Min.   :1.000  
##  1st Qu.:4.00   1st Qu.:5.000  
##  Median :5.00   Median :7.000  
##  Mean   :5.05   Mean   :6.054  
##  3rd Qu.:6.00   3rd Qu.:8.000  
##  Max.   :6.00   Max.   :8.000
summary(data2)
##  Diabetes_binary     HighBP          HighChol        CholCheck     
##  Min.   :0.0     Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0     1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.5     Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.5     Mean   :0.5635   Mean   :0.5257   Mean   :0.9753  
##  3rd Qu.:1.0     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       BMI            Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :12.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000      
##  1st Qu.:25.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000      
##  Median :29.00   Median :0.0000   Median :0.00000   Median :0.0000      
##  Mean   :29.86   Mean   :0.4753   Mean   :0.06217   Mean   :0.1478      
##  3rd Qu.:33.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000      
##  Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000      
##   PhysActivity       Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.00000  
##  Median :1.000   Median :1.0000   Median :1.0000   Median :0.00000  
##  Mean   :0.703   Mean   :0.6118   Mean   :0.7888   Mean   :0.04272  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  AnyHealthcare    NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.000   Median :0.00000   Median :3.000   Median : 0.000  
##  Mean   :0.955   Mean   :0.09391   Mean   :2.837   Mean   : 3.752  
##  3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.: 2.000  
##  Max.   :1.000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth        DiffWalk           Sex             Age        
##  Min.   : 0.00   Min.   :0.0000   Min.   :0.000   Min.   : 1.000  
##  1st Qu.: 0.00   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.: 7.000  
##  Median : 0.00   Median :0.0000   Median :0.000   Median : 9.000  
##  Mean   : 5.81   Mean   :0.2527   Mean   :0.457   Mean   : 8.584  
##  3rd Qu.: 6.00   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:11.000  
##  Max.   :30.00   Max.   :1.0000   Max.   :1.000   Max.   :13.000  
##    Education         Income     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:4.000  
##  Median :5.000   Median :6.000  
##  Mean   :4.921   Mean   :5.698  
##  3rd Qu.:6.000   3rd Qu.:8.000  
##  Max.   :6.000   Max.   :8.000
summary(data3)
##   Diabetes_012        HighBP         HighChol        CholCheck     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :1.0000  
##  Mean   :0.2969   Mean   :0.429   Mean   :0.4241   Mean   :0.9627  
##  3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :2.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##       BMI            Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :12.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000     
##  1st Qu.:24.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000     
##  Median :27.00   Median :0.0000   Median :0.00000   Median :0.00000     
##  Mean   :28.38   Mean   :0.4432   Mean   :0.04057   Mean   :0.09419     
##  3rd Qu.:31.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000     
##  Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000     
##   PhysActivity        Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.0000   
##  Mean   :0.7565   Mean   :0.6343   Mean   :0.8114   Mean   :0.0562   
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :2.000   Median : 0.000  
##  Mean   :0.9511   Mean   :0.08418   Mean   :2.511   Mean   : 3.185  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:3.000   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 6.000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 8.000  
##  Mean   : 4.242   Mean   :0.1682   Mean   :0.4403   Mean   : 8.032  
##  3rd Qu.: 3.000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:10.000  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##    Education        Income     
##  Min.   :1.00   Min.   :1.000  
##  1st Qu.:4.00   1st Qu.:5.000  
##  Median :5.00   Median :7.000  
##  Mean   :5.05   Mean   :6.054  
##  3rd Qu.:6.00   3rd Qu.:8.000  
##  Max.   :6.00   Max.   :8.000

before we go further, we make a small change in our dataframes. it is required because we want to draw plots

data1$Diabetes_binary <- as.factor(data1$Diabetes_binary)
data2$Diabetes_binary <- as.factor(data2$Diabetes_binary)
data3$Diabetes_012 <- as.factor(data3$Diabetes_012)

now we will take a deeper look to our data and do some exploratory analysis

first let’s take a look at the distribution of data according to each parameter. we use violin plots for this purpose.

par(mfrow = c(1, 3))
for (col_ in colnames(data1)){
  if (col_ == "Diabetes_binary"){
    next
  }
  
  vioplot(data1[[col_]], col = "yellow", border = "black",
  horizontal = FALSE, xlab = "dataset1", ylab = "Values",
  main = col_)
  
  vioplot(data2[[col_]], col = "orange", border = "black",
  horizontal = FALSE, xlab = "dataset2", ylab = "Values",
  main = col_)
  
  vioplot(data3[[col_]], col = "red", border = "black",
  horizontal = FALSE, xlab = "dataset3", ylab = "Values",
  main = col_)
}

now that we have some idea about the distribution of each variable, let’s compare the distribution of each variable with seperated responses. we do this to see if there is a meaningful difference in the distribution. and we use KDE plots this time

for (i in 2:22){
  
  colnames(data1)[i]
  
  p1 <- ggplot(data1, aes(x = data1[, i], fill = Diabetes_binary)) +
    geom_density(alpha = 0.4) +
    ggtitle("from data1") +
    labs(x = colnames(data1)[i])
  
  p2 <- ggplot(data2, aes(x = data2[, i], fill = Diabetes_binary)) +
    geom_density(alpha = 0.4) +
    ggtitle("from data2") +
    labs(x = colnames(data2)[i])
  
  p3 <- ggplot(data3, aes(x = data3[, i], fill = Diabetes_012)) +
    geom_density(alpha = 0.4) +
    ggtitle("from data3") +
    labs(x = colnames(data3)[i])
  
  print(p1)
  print(p2)
  print(p3)
}

as we can see, there are some obvious and meaningful difference in the distribution for some of the parameters. however this is just some fancy visualization and we will perform feature importance stuff in future inorder to understand what parameters are actually important.

as the final step of our EDA, we will do a multivariate analysis in order to see the correlations between the parameters

# we undo the change that we have made
data1$Diabetes_binary <- as.numeric(data1$Diabetes_binary) - 1
data2$Diabetes_binary <- as.numeric(data2$Diabetes_binary) - 1
data3$Diabetes_012 <- as.numeric(data3$Diabetes_012) - 1
# this corplots are not so readable because of their size
# I will put their images respectively so that you have a better view
corPlot(data1, cex = 0.5)

corPlot(data2, cex = 0.5)

corPlot(data3, cex = 0.5)

now our EDA stage is also done. we have a good understanding of what our data is and we have some information about different variables. here we will briefly mention some of the key points.

in the beginning we have drawn all the violin plots for all of our datasets. we can see that the distribution of data in our datasets is almost the same. there are only 2 parameters (GenHlth, Age) that have very minor differences in their distribution which would not affect the model output that much.

then we have the KDE plots showing the distribution of the variables referring to different response variable. we did this so that we understand which variables have a meaningful difference in their distributions and can be used for our model parameters. we can see different patterns here. if the plots are like each other, then that variable doesn’t affect the outcome that much (for instance Education) and vice versa.

afterwards we have the correlation plots. in these plots we can see that the variables HighBP, HighCol, BMI, GenHlth, DiffWalk, Age and HeartDisease have the most significant correlation with diabetes. we can also infer that the variables GenHlth, DiffWalk, PhysHlth, Age, Education and Income have a colorful row and have a somehow noticeable correlation with most of the variables.

and finally we have the pairplot. I will put the code that I used to get this plot, but it’s commented because the computation of this plot is very time consuming and resource demanding (it took me around 3 hours to get all 3 of them!) and I will put the picture of them instead.

# plot1 <- ggpairs( data1 )
# ggsave(filename = "/content/plot1.png", plot = plot1, width = 100, height = 50, limitsize = FALSE)
# print("done!")
# plot2 <- ggpairs( data2 )
# ggsave(filename = "/content/plot2.png", plot = plot2, width = 100, height = 50, limitsize = FALSE)
# print("done!")
# plot3 <- ggpairs( data3 )
# ggsave(filename = "/content/plot3.png", plot = plot3, width = 100, height = 50, limitsize = FALSE)
# print("done!")

here are the plots respectively:

knitr::include_graphics('./plots/plot1_1.png')

knitr::include_graphics('./plots/plot2_1.png')

knitr::include_graphics('./plots/plot3_1.png')

first let’s explain what’s going on in this massive shape! the boxes along the diagonals display the density plot for each variable ,and the boxes in the lower-left corner display the scatterplot between each pair of variables. The boxes in the upper right corner display the Pearson correlation coefficient between each variable. as we know the Pearson correlation gives us the measure of the linear relationship between two variables. It has a value between -1 to 1, where a value of -1 signifies a total negative linear correlation, 0 signifies no correlation, and +1 signifies a total positive correlation. it’s too hard to explain this figure, considering the fact that it’s giving us a ton of info on our data, but we can find out the pairwise correlation between our variables and also there is the scatterplot of our variables which can give us good insight on the relationship between different variables.

there is only one thing that I’d like to mention before we go for the model. as we saw above, the shape of variables in the 3 datasets that we have are almost the same and the differences can be ignored. we also know that before we train our models it’s better that we normalize the data. which means we should balance the scale of our data and the number of data points referring to different target variables should be fairly equal. or else our model will be biased. as we have seen in the beginning of this notebook, the response variables are extremely unbalanced and almost 90% of the data in data1 and data3 dataframes are referring to non-diabetic people. so we won’t be able to get a good and useful fit from them and the model will be heavily biased if we use those datasets. considering all of that, we will train our models using the balanced, 50-50 dataset.

first we will set a seed so that the results be reproducible

set.seed(12)

The first model that we are going to use in logistic regression

# Splitting dataset
split <- sample.split(data2, SplitRatio = 0.8)

train_set <- subset(data2, split == "TRUE")
test_set <- subset(data2, split == "FALSE")

# Training model
logistic_model <- glm(Diabetes_binary ~ ., 
                      data = train_set, 
                      family = "binomial")
logistic_model
## 
## Call:  glm(formula = Diabetes_binary ~ ., family = "binomial", data = train_set)
## 
## Coefficients:
##          (Intercept)                HighBP              HighChol  
##           -7.0235924             0.7425393             0.5834863  
##            CholCheck                   BMI                Smoker  
##            1.4748916             0.0766234             0.0001281  
##               Stroke  HeartDiseaseorAttack          PhysActivity  
##            0.1552362             0.2302216            -0.0506889  
##               Fruits               Veggies     HvyAlcoholConsump  
##           -0.0106047            -0.0763388            -0.7520092  
##        AnyHealthcare           NoDocbcCost               GenHlth  
##            0.0834139             0.0029193             0.5865147  
##             MentHlth              PhysHlth              DiffWalk  
##           -0.0049406            -0.0076018             0.1055333  
##                  Sex                   Age             Education  
##            0.2613702             0.1510941            -0.0359151  
##               Income  
##           -0.0571832  
## 
## Degrees of Freedom: 54624 Total (i.e. Null);  54603 Residual
## Null Deviance:       75730 
## Residual Deviance: 55900     AIC: 55940
summary(logistic_model)
## 
## Call:
## glm(formula = Diabetes_binary ~ ., family = "binomial", data = train_set)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -7.0235924  0.1443021 -48.673  < 2e-16 ***
## HighBP                0.7425393  0.0224509  33.074  < 2e-16 ***
## HighChol              0.5834863  0.0214771  27.168  < 2e-16 ***
## CholCheck             1.4748916  0.0956703  15.416  < 2e-16 ***
## BMI                   0.0766234  0.0017914  42.773  < 2e-16 ***
## Smoker                0.0001281  0.0214904   0.006 0.995243    
## Stroke                0.1552362  0.0463371   3.350 0.000808 ***
## HeartDiseaseorAttack  0.2302216  0.0322714   7.134 9.76e-13 ***
## PhysActivity         -0.0506889  0.0242623  -2.089 0.036689 *  
## Fruits               -0.0106047  0.0223068  -0.475 0.634500    
## Veggies              -0.0763388  0.0265397  -2.876 0.004022 ** 
## HvyAlcoholConsump    -0.7520092  0.0555929 -13.527  < 2e-16 ***
## AnyHealthcare         0.0834139  0.0539099   1.547 0.121795    
## NoDocbcCost           0.0029193  0.0388905   0.075 0.940163    
## GenHlth               0.5865147  0.0130431  44.967  < 2e-16 ***
## MentHlth             -0.0049406  0.0014585  -3.387 0.000706 ***
## PhysHlth             -0.0076018  0.0013608  -5.586 2.32e-08 ***
## DiffWalk              0.1055333  0.0294442   3.584 0.000338 ***
## Sex                   0.2613702  0.0218103  11.984  < 2e-16 ***
## Age                   0.1510941  0.0044589  33.886  < 2e-16 ***
## Education            -0.0359151  0.0116412  -3.085 0.002034 ** 
## Income               -0.0571832  0.0058939  -9.702  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 75726  on 54624  degrees of freedom
## Residual deviance: 55900  on 54603  degrees of freedom
## AIC: 55944
## 
## Number of Fisher Scoring iterations: 5
# evaluating performance on test
predict_reg <- predict(logistic_model, 
                       test_set, type = "response")

# applying the 0.5 threshold
predict_reg <- ifelse(predict_reg > 0.5, 1, 0)

# Evaluating model accuracy
# using confusion matrix
table(test_set$Diabetes_binary, predict_reg)
##    predict_reg
##        0    1
##   0 5829 2204
##   1 1899 6135
missing_classerr <- mean(predict_reg != test_set$Diabetes_binary)
print(paste('Accuracy =', 1 - missing_classerr))
## [1] "Accuracy = 0.74463185411091"
# ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_set$Diabetes_binary) 
ROCPer <- performance(ROCPred, measure = "tpr", 
                             x.measure = "fpr")
   
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
   
# Plotting curve
plot(ROCPer, colorize = TRUE, 
     print.cutoffs.at = seq(0.1, by = 0.1), # the threshold
     main = "ROC CURVE")
abline(a = 0, b = 1)

auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

as we can see, we have almost 74.4% accuracy with the logistic regression model which is quite good. also the AUC is 0.7446 which means our model is performing pretty well.

now let’s try other models to see if we can get a higher accuracy. now we will use random forest

# first we will convert the target value to be categorical
# so that we can implement the random forest algorithm

data2_factor <- data.frame(data2)
data2_factor$Diabetes_binary <- as.factor(data2_factor$Diabetes_binary)

split <- sample.split(data2_factor, SplitRatio = 0.8)

train_set <- subset(data2_factor, split == "TRUE")
test_set <- subset(data2_factor, split == "FALSE")

classifier_RF = randomForest(Diabetes_binary ~ ., data = train_set,
                             mtry = 5, importance = TRUE)

classifier_RF
## 
## Call:
##  randomForest(formula = Diabetes_binary ~ ., data = train_set,      mtry = 5, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 25.97%
## Confusion matrix:
##       0     1 class.error
## 0 19208  8105   0.2967451
## 1  6083 21231   0.2227063
y_pred = predict(classifier_RF, newdata = test_set)

confusion_mtx = table(test_set$Diabetes_binary, y_pred)
confusion_mtx
##    y_pred
##        0    1
##   0 5739 2294
##   1 1768 6264

now let’s see the variable importance according to random forest

importance(classifier_RF)
##                               0           1 MeanDecreaseAccuracy
## HighBP                93.760982 104.9724844           136.331536
## HighChol              66.840325  63.3012480            89.284448
## CholCheck             15.363266  57.5413897            57.591267
## BMI                  106.951156 109.0162907           154.687018
## Smoker                14.808492  -0.7524138             9.664933
## Stroke                37.469172  -2.0665334            29.168198
## HeartDiseaseorAttack  60.805536   4.0334527            54.646623
## PhysActivity          16.913400   0.6209237            12.983154
## Fruits                 9.790585  -4.0241985             3.546368
## Veggies                9.717093  -1.3408821             5.769547
## HvyAlcoholConsump     16.231622  41.7585824            42.788497
## AnyHealthcare         10.007715  -0.9753982             6.694762
## NoDocbcCost           15.603537  -1.7020894             9.453623
## GenHlth              164.414913  82.7740146           182.587514
## MentHlth              10.471535   2.9273439            10.383282
## PhysHlth              48.046367  -0.3061497            32.838443
## DiffWalk              47.331534  11.9522234            42.723845
## Sex                   16.716232  16.3315083            24.026195
## Age                   94.071732  83.2519704           126.977266
## Education             31.541297   1.0929207            24.873798
## Income                55.335634  14.2727062            51.638606
##                      MeanDecreaseGini
## HighBP                      2241.1299
## HighChol                    1005.7548
## CholCheck                    152.3054
## BMI                         3727.9234
## Smoker                       655.8481
## Stroke                       252.3914
## HeartDiseaseorAttack         462.9863
## PhysActivity                 574.8474
## Fruits                       648.1959
## Veggies                      525.9186
## HvyAlcoholConsump            239.6687
## AnyHealthcare                192.8662
## NoDocbcCost                  307.1197
## GenHlth                     2918.6002
## MentHlth                    1254.8706
## PhysHlth                    1606.1221
## DiffWalk                     633.9565
## Sex                          612.8367
## Age                         2772.2541
## Education                   1297.5732
## Income                      1864.6530

now we will draw a plot of our random forest model and calculate the MSE

plot (y_pred, test_set$Diabetes_binary,
      xlab = "prediction", ylab = "true value")

missing_classerr <- mean(y_pred != test_set$Diabetes_binary)
print(paste('Accuracy =', 1 - missing_classerr))
## [1] "Accuracy = 0.747152194211018"

we see that the accuracy is approximately 74.7%. so the random forest model couldn’t improve our accuracy that much.

now let’s go and try another model. we will try XGBoost and see if there will be any improvement

split <- sample.split(data2, SplitRatio = 0.8)

train_set <- subset(data2, split == "TRUE")
test_set <- subset(data2, split == "FALSE")

# separating predictor and response variables
train_x = data.matrix(train_set[, -1])
train_y = train_set[, 1]

test_x = data.matrix(test_set[, -1])
test_y = test_set[, 1]

# define final training and testing sets
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)

watchlist = list(train=xgb_train, test=xgb_test)

# fit model and display training and testing data at each round
model = xgb.train(data = xgb_train, max.depth = 3, watchlist=watchlist,
                  nrounds = 200)
## [1]  train-rmse:0.466636 test-rmse:0.467304 
## [2]  train-rmse:0.448256 test-rmse:0.449669 
## [3]  train-rmse:0.434992 test-rmse:0.436970 
## [4]  train-rmse:0.428125 test-rmse:0.430480 
## [5]  train-rmse:0.422700 test-rmse:0.425639 
## [6]  train-rmse:0.418885 test-rmse:0.421995 
## [7]  train-rmse:0.416398 test-rmse:0.419905 
## [8]  train-rmse:0.414664 test-rmse:0.418240 
## [9]  train-rmse:0.413353 test-rmse:0.417111 
## [10] train-rmse:0.412232 test-rmse:0.415976 
## [11] train-rmse:0.411208 test-rmse:0.415075 
## [12] train-rmse:0.410543 test-rmse:0.414422 
## [13] train-rmse:0.410039 test-rmse:0.413911 
## [14] train-rmse:0.409656 test-rmse:0.413615 
## [15] train-rmse:0.409276 test-rmse:0.413272 
## [16] train-rmse:0.409012 test-rmse:0.413062 
## [17] train-rmse:0.408779 test-rmse:0.412880 
## [18] train-rmse:0.408528 test-rmse:0.412733 
## [19] train-rmse:0.408338 test-rmse:0.412579 
## [20] train-rmse:0.408179 test-rmse:0.412502 
## [21] train-rmse:0.408005 test-rmse:0.412460 
## [22] train-rmse:0.407848 test-rmse:0.412406 
## [23] train-rmse:0.407731 test-rmse:0.412348 
## [24] train-rmse:0.407633 test-rmse:0.412281 
## [25] train-rmse:0.407518 test-rmse:0.412259 
## [26] train-rmse:0.407427 test-rmse:0.412193 
## [27] train-rmse:0.407322 test-rmse:0.412139 
## [28] train-rmse:0.407244 test-rmse:0.412084 
## [29] train-rmse:0.407168 test-rmse:0.412086 
## [30] train-rmse:0.407077 test-rmse:0.412051 
## [31] train-rmse:0.407024 test-rmse:0.412015 
## [32] train-rmse:0.406853 test-rmse:0.411884 
## [33] train-rmse:0.406778 test-rmse:0.411857 
## [34] train-rmse:0.406718 test-rmse:0.411858 
## [35] train-rmse:0.406643 test-rmse:0.411829 
## [36] train-rmse:0.406559 test-rmse:0.411779 
## [37] train-rmse:0.406470 test-rmse:0.411765 
## [38] train-rmse:0.406438 test-rmse:0.411765 
## [39] train-rmse:0.406377 test-rmse:0.411771 
## [40] train-rmse:0.406338 test-rmse:0.411733 
## [41] train-rmse:0.406230 test-rmse:0.411708 
## [42] train-rmse:0.406183 test-rmse:0.411686 
## [43] train-rmse:0.406126 test-rmse:0.411691 
## [44] train-rmse:0.406034 test-rmse:0.411630 
## [45] train-rmse:0.405979 test-rmse:0.411629 
## [46] train-rmse:0.405930 test-rmse:0.411586 
## [47] train-rmse:0.405905 test-rmse:0.411573 
## [48] train-rmse:0.405859 test-rmse:0.411609 
## [49] train-rmse:0.405824 test-rmse:0.411606 
## [50] train-rmse:0.405786 test-rmse:0.411596 
## [51] train-rmse:0.405751 test-rmse:0.411574 
## [52] train-rmse:0.405699 test-rmse:0.411532 
## [53] train-rmse:0.405668 test-rmse:0.411528 
## [54] train-rmse:0.405603 test-rmse:0.411531 
## [55] train-rmse:0.405537 test-rmse:0.411517 
## [56] train-rmse:0.405486 test-rmse:0.411473 
## [57] train-rmse:0.405442 test-rmse:0.411454 
## [58] train-rmse:0.405402 test-rmse:0.411421 
## [59] train-rmse:0.405360 test-rmse:0.411430 
## [60] train-rmse:0.405338 test-rmse:0.411406 
## [61] train-rmse:0.405296 test-rmse:0.411366 
## [62] train-rmse:0.405262 test-rmse:0.411398 
## [63] train-rmse:0.405212 test-rmse:0.411404 
## [64] train-rmse:0.405191 test-rmse:0.411412 
## [65] train-rmse:0.405160 test-rmse:0.411390 
## [66] train-rmse:0.405131 test-rmse:0.411378 
## [67] train-rmse:0.405084 test-rmse:0.411302 
## [68] train-rmse:0.405045 test-rmse:0.411302 
## [69] train-rmse:0.405017 test-rmse:0.411270 
## [70] train-rmse:0.404992 test-rmse:0.411257 
## [71] train-rmse:0.404972 test-rmse:0.411273 
## [72] train-rmse:0.404904 test-rmse:0.411252 
## [73] train-rmse:0.404870 test-rmse:0.411224 
## [74] train-rmse:0.404844 test-rmse:0.411248 
## [75] train-rmse:0.404791 test-rmse:0.411240 
## [76] train-rmse:0.404767 test-rmse:0.411244 
## [77] train-rmse:0.404741 test-rmse:0.411215 
## [78] train-rmse:0.404695 test-rmse:0.411205 
## [79] train-rmse:0.404663 test-rmse:0.411202 
## [80] train-rmse:0.404641 test-rmse:0.411195 
## [81] train-rmse:0.404585 test-rmse:0.411179 
## [82] train-rmse:0.404533 test-rmse:0.411108 
## [83] train-rmse:0.404502 test-rmse:0.411101 
## [84] train-rmse:0.404482 test-rmse:0.411105 
## [85] train-rmse:0.404449 test-rmse:0.411118 
## [86] train-rmse:0.404414 test-rmse:0.411088 
## [87] train-rmse:0.404358 test-rmse:0.411058 
## [88] train-rmse:0.404311 test-rmse:0.411005 
## [89] train-rmse:0.404287 test-rmse:0.411003 
## [90] train-rmse:0.404259 test-rmse:0.410992 
## [91] train-rmse:0.404246 test-rmse:0.410987 
## [92] train-rmse:0.404235 test-rmse:0.410993 
## [93] train-rmse:0.404224 test-rmse:0.411004 
## [94] train-rmse:0.404192 test-rmse:0.411004 
## [95] train-rmse:0.404153 test-rmse:0.411033 
## [96] train-rmse:0.404126 test-rmse:0.411005 
## [97] train-rmse:0.404103 test-rmse:0.410980 
## [98] train-rmse:0.404049 test-rmse:0.410964 
## [99] train-rmse:0.404027 test-rmse:0.410955 
## [100]    train-rmse:0.403987 test-rmse:0.410941 
## [101]    train-rmse:0.403955 test-rmse:0.410913 
## [102]    train-rmse:0.403932 test-rmse:0.410917 
## [103]    train-rmse:0.403901 test-rmse:0.410913 
## [104]    train-rmse:0.403875 test-rmse:0.410875 
## [105]    train-rmse:0.403830 test-rmse:0.410898 
## [106]    train-rmse:0.403803 test-rmse:0.410906 
## [107]    train-rmse:0.403773 test-rmse:0.410914 
## [108]    train-rmse:0.403737 test-rmse:0.410910 
## [109]    train-rmse:0.403703 test-rmse:0.410913 
## [110]    train-rmse:0.403665 test-rmse:0.410906 
## [111]    train-rmse:0.403630 test-rmse:0.410910 
## [112]    train-rmse:0.403593 test-rmse:0.410955 
## [113]    train-rmse:0.403565 test-rmse:0.410981 
## [114]    train-rmse:0.403536 test-rmse:0.410960 
## [115]    train-rmse:0.403508 test-rmse:0.410979 
## [116]    train-rmse:0.403471 test-rmse:0.411002 
## [117]    train-rmse:0.403451 test-rmse:0.411014 
## [118]    train-rmse:0.403425 test-rmse:0.411016 
## [119]    train-rmse:0.403403 test-rmse:0.411034 
## [120]    train-rmse:0.403368 test-rmse:0.411003 
## [121]    train-rmse:0.403339 test-rmse:0.411020 
## [122]    train-rmse:0.403315 test-rmse:0.411044 
## [123]    train-rmse:0.403260 test-rmse:0.411057 
## [124]    train-rmse:0.403231 test-rmse:0.411049 
## [125]    train-rmse:0.403207 test-rmse:0.411012 
## [126]    train-rmse:0.403171 test-rmse:0.411002 
## [127]    train-rmse:0.403145 test-rmse:0.411009 
## [128]    train-rmse:0.403105 test-rmse:0.410996 
## [129]    train-rmse:0.403094 test-rmse:0.411008 
## [130]    train-rmse:0.403066 test-rmse:0.411016 
## [131]    train-rmse:0.403041 test-rmse:0.411007 
## [132]    train-rmse:0.403029 test-rmse:0.411023 
## [133]    train-rmse:0.402994 test-rmse:0.410981 
## [134]    train-rmse:0.402974 test-rmse:0.411011 
## [135]    train-rmse:0.402951 test-rmse:0.411018 
## [136]    train-rmse:0.402926 test-rmse:0.411004 
## [137]    train-rmse:0.402900 test-rmse:0.411033 
## [138]    train-rmse:0.402871 test-rmse:0.411043 
## [139]    train-rmse:0.402842 test-rmse:0.411020 
## [140]    train-rmse:0.402821 test-rmse:0.410996 
## [141]    train-rmse:0.402785 test-rmse:0.411007 
## [142]    train-rmse:0.402762 test-rmse:0.411028 
## [143]    train-rmse:0.402742 test-rmse:0.411020 
## [144]    train-rmse:0.402724 test-rmse:0.411017 
## [145]    train-rmse:0.402699 test-rmse:0.411032 
## [146]    train-rmse:0.402672 test-rmse:0.411056 
## [147]    train-rmse:0.402644 test-rmse:0.411062 
## [148]    train-rmse:0.402624 test-rmse:0.411048 
## [149]    train-rmse:0.402586 test-rmse:0.411047 
## [150]    train-rmse:0.402562 test-rmse:0.411042 
## [151]    train-rmse:0.402546 test-rmse:0.411066 
## [152]    train-rmse:0.402537 test-rmse:0.411069 
## [153]    train-rmse:0.402504 test-rmse:0.411029 
## [154]    train-rmse:0.402484 test-rmse:0.411021 
## [155]    train-rmse:0.402459 test-rmse:0.411013 
## [156]    train-rmse:0.402419 test-rmse:0.411008 
## [157]    train-rmse:0.402395 test-rmse:0.411002 
## [158]    train-rmse:0.402365 test-rmse:0.411008 
## [159]    train-rmse:0.402341 test-rmse:0.411012 
## [160]    train-rmse:0.402322 test-rmse:0.411005 
## [161]    train-rmse:0.402288 test-rmse:0.410977 
## [162]    train-rmse:0.402264 test-rmse:0.410981 
## [163]    train-rmse:0.402253 test-rmse:0.411017 
## [164]    train-rmse:0.402219 test-rmse:0.411021 
## [165]    train-rmse:0.402199 test-rmse:0.411041 
## [166]    train-rmse:0.402169 test-rmse:0.411062 
## [167]    train-rmse:0.402133 test-rmse:0.411017 
## [168]    train-rmse:0.402116 test-rmse:0.411030 
## [169]    train-rmse:0.402101 test-rmse:0.411019 
## [170]    train-rmse:0.402085 test-rmse:0.411031 
## [171]    train-rmse:0.402068 test-rmse:0.411013 
## [172]    train-rmse:0.402063 test-rmse:0.411005 
## [173]    train-rmse:0.402049 test-rmse:0.410991 
## [174]    train-rmse:0.402016 test-rmse:0.410988 
## [175]    train-rmse:0.401996 test-rmse:0.411019 
## [176]    train-rmse:0.401971 test-rmse:0.411016 
## [177]    train-rmse:0.401955 test-rmse:0.411021 
## [178]    train-rmse:0.401915 test-rmse:0.410995 
## [179]    train-rmse:0.401892 test-rmse:0.410989 
## [180]    train-rmse:0.401870 test-rmse:0.410974 
## [181]    train-rmse:0.401851 test-rmse:0.410985 
## [182]    train-rmse:0.401824 test-rmse:0.411010 
## [183]    train-rmse:0.401799 test-rmse:0.410995 
## [184]    train-rmse:0.401784 test-rmse:0.410985 
## [185]    train-rmse:0.401764 test-rmse:0.411004 
## [186]    train-rmse:0.401744 test-rmse:0.410984 
## [187]    train-rmse:0.401718 test-rmse:0.411009 
## [188]    train-rmse:0.401694 test-rmse:0.411009 
## [189]    train-rmse:0.401658 test-rmse:0.410998 
## [190]    train-rmse:0.401638 test-rmse:0.410992 
## [191]    train-rmse:0.401627 test-rmse:0.410995 
## [192]    train-rmse:0.401612 test-rmse:0.411014 
## [193]    train-rmse:0.401583 test-rmse:0.410996 
## [194]    train-rmse:0.401552 test-rmse:0.411002 
## [195]    train-rmse:0.401536 test-rmse:0.411004 
## [196]    train-rmse:0.401521 test-rmse:0.410985 
## [197]    train-rmse:0.401500 test-rmse:0.410992 
## [198]    train-rmse:0.401490 test-rmse:0.410993 
## [199]    train-rmse:0.401484 test-rmse:0.410993 
## [200]    train-rmse:0.401473 test-rmse:0.410994

From the output we can see that the minimum testing RMSE is achieved at 85 rounds. Beyond this point, the test RMSE actually begins to increase, which is a sign that we’re overfitting the training data.

Thus, we’ll define our final model to use 85 rounds:

final = xgboost(data = xgb_train, max.depth = 3, nrounds = 85, verbose = 0)
pred_y <- predict(final, test_x)

prediction <- as.numeric(pred_y > 0.5)

mse = mean((test_y - prediction)^2)
print(paste("MSE= ", mse))
## [1] "MSE=  0.24951764486214"
print(paste("RMSE= ", sqrt(mse)))
## [1] "RMSE=  0.499517411970934"
missing_classerr <- mean(prediction != test_y)
print(paste('Accuracy =', 1 - missing_classerr))
## [1] "Accuracy = 0.75048235513786"

as we can see, now we have 76% accuracy. which is quite good and we have progressed a bit comparing to the last 2 models :)

however I’m willing to try SVM as well. let’s see how much accuracy we can get using SVM

# we will use the same test/train set that was used in the above moel

svm_classifier = svm(formula = Diabetes_binary ~ .,
                 data = train_set,
                 type = 'C-classification', # binary classification
                 kernel = 'linear')

y_pred = predict(svm_classifier, newdata = test_set)

# Making the Confusion Matrix
cm = table(test_set$Diabetes_binary, y_pred)
cm
##    y_pred
##        0    1
##   0 5682 2352
##   1 1715 6318
missing_classerr <- mean(y_pred != test_set$Diabetes_binary)
print(paste('Accuracy =', 1 - missing_classerr))
## [1] "Accuracy = 0.746872471525487"

as we can see, the accuracy of the model with linear kernel is 75%. which is almost the same as all the previous models. let’s see what happens if we change the kernel

svm_classifier = svm(formula = Diabetes_binary ~ .,
                 data = train_set,
                 type = 'C-classification', # binary classification
                 kernel = 'radial')

y_pred = predict(svm_classifier, newdata = test_set)

# Making the Confusion Matrix
cm = table(test_set$Diabetes_binary, y_pred)
cm
##    y_pred
##        0    1
##   0 5603 2431
##   1 1590 6443
missing_classerr <- mean(y_pred != test_set$Diabetes_binary)
print(paste('Accuracy =', 1 - missing_classerr))
## [1] "Accuracy = 0.749735482666335"

as we can see, with the radial kernel the accuracy is 75.6% which is almost the same as before.

now let’s answer the questions:

q1- yes. although we have fitted so many models right now, I knew it from the beginning. because we have a great amount of training data and the variables are related to diabetes we can fit a good model.

q2- We have multiple ways up there to find out which variables have the most effect in our outcome. I’ve said them when I was analyzing the correlation plot. also from the results of the variable importance that the random forest gave us, we can say that the variables HighBP, HighChol, BMI, GenHlth and age have significant importance.

q3- yes. because as we see, atleast half of these variables are not that useful for the outcome of the model. for instance we can only use 8 parameters for our model and still get a good accuracy.

q4- in my final models, I’ve put all the variables (except for random forest and XGBoost because their way is different). but I could have eliminated most of them. I didn’t do that because it didn’t improve the accuracy of my model and it didn’t have a significant effect. and we all know that this variable selection can be easily implemented in code!

q5- ofcourse. as we said in the previous answers, some variables can be eliminated, and without them our model will still provide the same accuracy. so if we want to make the forums easier and faster, we can only ask 8 or 9 questions from people, and give them a somewhat good estimate about their health condition. note that diagnosing diabetes is very important and we don’t have so much room for mistakes. so even the accuracies that we managed to reach, wouldn’t be enough in a real world scenario and a doctor is required to take any further action.

Thanks for reading my markdown!